Analysis of weather change and patterns for New York State from 2015 Jan 1 to 2020 Dec 31¶

authors: Ravi Konjeti, Lakshmi Bhavani and Aleksandar Kamenev

Data Dictionary¶

Weather Data for NY state from 2015 to 2020 inclusive, from 5 stations across the state

source: https://www.ncdc.noaa.gov/cdo-web/datasets

- STATION: Station identification code
- NAME: Name of the station (usually city/airport name)
- LATITUDE: (decimated degrees w/northern hemisphere values > 0, southern hemisphere values < 0)
- LONGITUDE: (decimated degrees w/western hemisphere values < 0, eastern hemisphere values > 0)
- ELEVATION: above mean sea level (tenths of meters)
- DATE: Year of the record (4 digits) followed by month (2 digits) and day (2 digits)
- AWND: Average daily wind speed
- PRCP: Precipitation 
- SNOW: Snowfall
- SNWD: Snow depth
- TAVG: Average temperature
- TMAX: Maximum temperature
- TMIN: Minimum temperature 
- WDF2: Direction of fastest 2-minute wind (degrees)
- WSF2: Fastest 2-minute wind speed

Please see http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt for a complete list of stations and their metadata.

1. Import libraries and load dataset¶

Imports¶

In [1]:
import pandas as pd
import edapy.datainfo as d_info
import edapy.load_data as d_load
import edapy.plot as my_plot
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

Load data¶

In [2]:
# load in data 
weather_df = d_load.load_df(r'data\dataset.csv')
In [3]:
# instantiate custom package class for eda 
data = d_info.DataInfo(weather_df)

The method called below shows the first 5 records in the df so we can get a feel for the data. It also lists the features present and gives information about the feature name, number of non-null values, and data type.

In [4]:
data.describe()
STATION NAME LATITUDE LONGITUDE ELEVATION DATE AWND PRCP SNOW SNWD TAVG TMAX TMIN WDF2 WSF2
0 USW00014733 BUFFALO NIAGARA INTERNATIONAL, NY US 42.93998 -78.73606 216.2 2015-01-01 24.61 0.00 0.3 1.2 23.0 32.0 20.0 240 40.0
1 USW00014733 BUFFALO NIAGARA INTERNATIONAL, NY US 42.93998 -78.73606 216.2 2015-01-02 13.20 0.00 0.0 1.2 31.0 33.0 22.0 250 29.1
2 USW00014733 BUFFALO NIAGARA INTERNATIONAL, NY US 42.93998 -78.73606 216.2 2015-01-03 8.50 0.58 0.0 0.0 27.0 48.0 22.0 60 15.0
3 USW00014733 BUFFALO NIAGARA INTERNATIONAL, NY US 42.93998 -78.73606 216.2 2015-01-04 17.67 0.52 0.0 0.0 44.0 55.0 28.0 250 38.0
4 USW00014733 BUFFALO NIAGARA INTERNATIONAL, NY US 42.93998 -78.73606 216.2 2015-01-05 19.24 0.00 0.1 0.0 20.0 28.0 9.0 270 31.1
LATITUDE LONGITUDE ELEVATION AWND PRCP SNOW SNWD TAVG TMAX TMIN WDF2 WSF2
count 10960.000000 10960.000000 10960.000000 10957.000000 10959.000000 10960.000000 10960.000000 10593.000000 10959.000000 10959.000000 10960.000000 10960.000000
mean 41.865404 -75.076166 163.080000 9.320228 0.118639 0.167527 0.736168 51.274615 59.714755 43.197737 220.053832 19.880648
std 0.969652 2.071932 177.556465 4.057914 0.292714 1.045389 2.729762 18.206960 19.317414 17.991548 94.062662 6.596222
min 40.639150 -78.736060 2.700000 0.450000 0.000000 0.000000 0.000000 -6.000000 1.000000 -18.000000 10.000000 4.900000
25% 40.793890 -75.979930 25.400000 6.490000 0.000000 0.000000 0.000000 37.000000 44.000000 30.000000 170.000000 15.000000
50% 42.206780 -73.799130 85.400000 8.720000 0.000000 0.000000 0.000000 52.000000 61.000000 43.000000 230.000000 19.900000
75% 42.747220 -73.763900 216.200000 11.410000 0.070000 0.000000 0.000000 67.000000 77.000000 59.000000 300.000000 23.900000
max 42.939980 -73.101810 485.700000 29.970000 4.020000 31.200000 39.000000 89.000000 99.000000 82.000000 360.000000 67.100000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10960 entries, 0 to 10959
Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   STATION    10960 non-null  object 
 1   NAME       10960 non-null  object 
 2   LATITUDE   10960 non-null  float64
 3   LONGITUDE  10960 non-null  float64
 4   ELEVATION  10960 non-null  float64
 5   DATE       10960 non-null  object 
 6   AWND       10957 non-null  float64
 7   PRCP       10959 non-null  float64
 8   SNOW       10960 non-null  float64
 9   SNWD       10960 non-null  float64
 10  TAVG       10593 non-null  float64
 11  TMAX       10959 non-null  float64
 12  TMIN       10959 non-null  float64
 13  WDF2       10960 non-null  int64  
 14  WSF2       10960 non-null  float64
dtypes: float64(11), int64(1), object(3)
memory usage: 1.3+ MB
None

Observation 1.01¶

From the above information, some key details about the data set present can be observed.

  1. For the most part we have numeric values but there are some object time values.

    • STATION & NAME are strong candidates for conversion to String and also can potentially be seen as candidates for categorical encoding (one-hot encoding). However, since we will only be using these values for identification of the stations, the splitting of the data and no predictions will be made on what station the data point is from as this is not of interest to this research we will not focus on these two features beyond use for splitting and differentiation between station to perform comparisons on other features.

    • DATE is a good candidate for conversion to date-time format and base of comparison when doing time series analysis.

  1. The total number of features is 15, the total number of records are 10960 and some missing records are present. More about missing records in the next section.
In [5]:
# to date-time format
data.to_date_time("DATE")
# While we can convert to string, two questions arise. Is it needed? Is one hot encoding not more beneficial.
# data.as_string("STATION", "NAME")

2. Check for missing values¶

The method called below gives information in detail about the number of missing and present values per feature in the dataset as well as provide details about the possible values a feature may have and the number of times they occur. More on this will follow later on in the notebook when we visualize data per feature looking for the outlier.

In [6]:
data.nulls()
Show number of missing instances per feature 
--------------------
STATION        0
NAME           0
LATITUDE       0
LONGITUDE      0
ELEVATION      0
DATE           0
AWND           3
PRCP           1
SNOW           0
SNWD           0
TAVG         367
TMAX           1
TMIN           1
WDF2           0
WSF2           0
dtype: int64
Show number of present instances per feature 
--------------------
STATION      10960
NAME         10960
LATITUDE     10960
LONGITUDE    10960
ELEVATION    10960
DATE         10960
AWND         10957
PRCP         10959
SNOW         10960
SNWD         10960
TAVG         10593
TMAX         10959
TMIN         10959
WDF2         10960
WSF2         10960
dtype: int64
Shape (10960, 15)

** ** ** ** ** ** ** ** ** ** 

Feature: STATION

Number of unique values: 5 

['USW00014733' 'USW00014735' 'USW00004725' 'USW00004781' 'USW00094789']
_ _ _ _ _ _ _ _ _ _ _ _ _ _

USW00014733    2192
USW00014735    2192
USW00004725    2192
USW00004781    2192
USW00094789    2192
Name: STATION, dtype: int64
___________________________


** ** ** ** ** ** ** ** ** ** 

Feature: NAME

Number of unique values: 5 

['BUFFALO NIAGARA INTERNATIONAL, NY US'
 'ALBANY INTERNATIONAL AIRPORT, NY US' 'BINGHAMTON GREATER AP, NY US'
 'ISLIP LI MACARTHUR AIRPORT, NY US' 'JFK INTERNATIONAL AIRPORT, NY US']
_ _ _ _ _ _ _ _ _ _ _ _ _ _

BUFFALO NIAGARA INTERNATIONAL, NY US    2192
ALBANY INTERNATIONAL AIRPORT, NY US     2192
BINGHAMTON GREATER AP, NY US            2192
ISLIP LI MACARTHUR AIRPORT, NY US       2192
JFK INTERNATIONAL AIRPORT, NY US        2192
Name: NAME, dtype: int64
___________________________


** ** ** ** ** ** ** ** ** ** 

Feature: LATITUDE

Number of unique values: 5 

[42.93998 42.74722 42.20678 40.79389 40.63915]
_ _ _ _ _ _ _ _ _ _ _ _ _ _

42.93998    2192
42.74722    2192
42.20678    2192
40.79389    2192
40.63915    2192
Name: LATITUDE, dtype: int64
___________________________


** ** ** ** ** ** ** ** ** ** 

Feature: LONGITUDE

Number of unique values: 5 

[-78.73606 -73.79913 -75.97993 -73.10181 -73.7639 ]
_ _ _ _ _ _ _ _ _ _ _ _ _ _

-78.73606    2192
-73.79913    2192
-75.97993    2192
-73.10181    2192
-73.76390    2192
Name: LONGITUDE, dtype: int64
___________________________


** ** ** ** ** ** ** ** ** ** 

Feature: ELEVATION

Number of unique values: 5 

[216.2  85.4 485.7  25.4   2.7]
_ _ _ _ _ _ _ _ _ _ _ _ _ _

216.2    2192
85.4     2192
485.7    2192
25.4     2192
2.7      2192
Name: ELEVATION, dtype: int64
___________________________


** ** ** ** ** ** ** ** ** ** 

Feature: DATE

Number of unique values: 2192 

['2015-01-01T00:00:00.000000000' '2015-01-02T00:00:00.000000000'
 '2015-01-03T00:00:00.000000000' ... '2020-12-29T00:00:00.000000000'
 '2020-12-30T00:00:00.000000000' '2020-12-31T00:00:00.000000000']
_ _ _ _ _ _ _ _ _ _ _ _ _ _

2015-01-01    5
2019-01-04    5
2018-12-29    5
2018-12-30    5
2018-12-31    5
             ..
2016-12-25    5
2016-12-24    5
2016-12-23    5
2016-12-22    5
2020-12-31    5
Name: DATE, Length: 2192, dtype: int64
___________________________


** ** ** ** ** ** ** ** ** ** 

Feature: AWND

Number of unique values: 123 

[24.61 13.2   8.5  17.67 19.24 10.74 16.55 19.46 18.34 11.86 10.96  8.05
  7.83 16.33 13.65  9.84 17.22 12.3   4.92  4.7  14.09  8.95 13.42  6.93
 10.51  9.17 14.54 11.18  6.26 15.66  3.8  12.97 10.29  8.28 10.07 12.53
  5.82  9.62 15.21 14.76  5.59 11.41  6.04 14.32 11.63  6.71 13.87 16.11
  4.03  7.16 17.    4.47 21.47  7.61 18.12 19.01  9.4   8.72  5.37  3.58
  7.38 12.08  5.14  6.49 12.75  3.13  4.25 23.71 19.69 21.92 17.9  20.8
  3.36 18.79 15.88 14.99 15.43  2.91 17.45 19.91 20.36 21.03 20.13 16.78
  2.68 22.37 22.15 25.72 21.25  2.46 18.57 29.3  26.4  25.28 25.95  1.79
 22.82 23.04  2.01  1.57  1.12  0.89  2.24  1.34  0.45 20.58  0.67   nan
 24.83 23.49 21.7  23.94 27.74 25.05 23.26 24.16 25.5  28.19 29.97 28.41
 22.59 27.51 24.38]
_ _ _ _ _ _ _ _ _ _ _ _ _ _

8.50     372
6.71     360
5.82     346
8.05     320
7.61     319
        ... 
28.19      1
25.50      1
27.74      1
0.45       1
24.38      1
Name: AWND, Length: 122, dtype: int64
___________________________


** ** ** ** ** ** ** ** ** ** 

Feature: PRCP

Number of unique values: 210 

[0.   0.58 0.52 0.02 0.03 0.12 0.4  0.08 0.34 0.04 0.1  0.26 0.01 0.23
 0.42 0.35 0.05 0.07 0.3  0.14 0.22 0.09 0.17 0.31 0.39 0.19 0.28 0.37
 0.41 0.2  0.27 0.13 0.49 2.44 0.53 0.8  0.57 0.25 0.11 0.46 1.72 0.06
 0.43 0.38 0.65 0.16 0.44 2.85 0.81 1.85 1.   0.36 0.76 0.18 1.64 0.15
 0.33 0.73 0.51 0.24 0.66 0.89 0.94 0.29 0.32 0.71 1.04 0.72 1.54 1.76
 1.1  1.3  0.54 0.83 0.21 0.86 0.69 0.6  0.64 1.58 1.95 1.07 1.29 1.37
 1.79 0.77 2.29 1.27 1.25 0.67 0.82 0.63 1.81 0.91 2.01 0.5  0.59 1.33
 0.45 0.87 0.88 1.32 1.4  1.15 0.47 1.46 0.55 0.48 0.68 0.9  1.12 1.22
 0.74 2.14 0.92 1.18 1.43 1.13 0.78 0.95 0.56 0.62 1.03 1.21 0.61 0.79
 1.48 1.62 1.84 1.57 1.11 2.74 0.7  1.2  2.   2.55 1.08 1.55 0.75 0.96
 1.06 1.24 1.05 0.97 2.26 1.16  nan 1.28 1.59 0.98 1.6  1.63 0.99 0.84
 3.92 1.02 1.38 1.73 1.91 1.23 1.01 1.17 1.65 1.36 0.85 2.18 1.09 1.35
 1.51 1.34 1.49 1.47 2.7  1.66 1.45 1.19 1.39 0.93 1.8  1.53 2.02 1.5
 1.69 1.44 1.41 2.3  2.13 1.68 1.92 1.87 4.02 3.11 1.7  1.52 1.26 1.61
 2.05 2.75 2.48 2.9  2.47 1.42 1.94 2.07 1.78 1.14 2.84 2.16 1.71 2.33]
_ _ _ _ _ _ _ _ _ _ _ _ _ _

0.00    6572
0.01     488
0.02     328
0.03     227
0.04     183
        ... 
2.02       1
1.44       1
1.41       1
2.30       1
2.33       1
Name: PRCP, Length: 209, dtype: int64
___________________________


** ** ** ** ** ** ** ** ** ** 

Feature: SNOW

Number of unique values: 102 

[ 0.3  0.   0.1  1.   0.8  3.8  6.5  2.9  3.6  0.2  1.6  0.9  1.8  0.6
  4.4  8.2  0.7  5.4  2.3  1.1  2.4  2.7  1.4  0.5  6.9  2.1  0.4  1.5
  1.9  1.2  2.2  1.3  7.4  3.3  5.8  4.9  8.9  2.8  3.9  9.6  6.3  3.4
  5.6  3.2 13.2  1.7  2.6  4.   5.   2.   3.1  2.5  3.7  3.5  8.6  5.7
  6.2 17.2  6.4 13.6  4.7  8.7  4.5  4.1 18.4 11.9  3.   6.1  4.6  4.3
 11.2  7.6 17.   5.5  8.4 11.5 10.4  4.8 13.3  6.8 19.7  8.8 10.2 10.5
 31.2 12.5  8.   7.2  5.1 26.4  7.5 17.4  5.3 23.4  9.8  9.7 14.3 16.
 14.9  4.2 30.3  8.3]
_ _ _ _ _ _ _ _ _ _ _ _ _ _

0.0     9903
0.1      178
0.2       96
0.3       73
0.6       49
        ... 
10.4       1
13.3       1
6.8        1
4.7        1
8.3        1
Name: SNOW, Length: 102, dtype: int64
___________________________


** ** ** ** ** ** ** ** ** ** 

Feature: SNWD

Number of unique values: 31 

[ 1.2  0.   2.   9.8  7.1  9.1  7.9  5.1  3.1  3.9  5.9 13.  16.1 18.1
 20.9 22.  22.8 26.  25.2 24.  20.1 14.2 11.8 11.  15.  18.9 16.9 39.
 31.1 29.1 28. ]
_ _ _ _ _ _ _ _ _ _ _ _ _ _

0.0     9514
1.2      328
2.0      225
3.1      170
3.9      116
5.1      111
5.9       75
7.9       67
7.1       64
9.1       45
9.8       31
11.8      27
13.0      25
11.0      24
22.0      17
16.1      16
14.2      14
18.1      14
15.0      14
20.9      12
20.1      11
16.9      11
18.9      10
22.8       6
26.0       4
25.2       3
24.0       2
39.0       1
31.1       1
29.1       1
28.0       1
Name: SNWD, dtype: int64
___________________________


** ** ** ** ** ** ** ** ** ** 

Feature: TAVG

Number of unique values: 97 

[23. 31. 27. 44. 20. 12. 11.  8. 16. 21. 30. 10.  7. 18. 25. 15. 39. 24.
 26. 13. 14. 17.  2. -1. -5.  6. -4. 19.  5.  3. 22. 33.  9. 32. 36. 35.
 42. 37. 28. 38. 34. 49. 47. 40. 53. 45. 60. 51. 52. 57. 43. 54. 55. 58.
 65. 62. 66. 72. 76. 74. 73. 61. 59. 64. 68. 63. 70. 71. 67. 77. 48. 56.
 69. 75. 79. 78. 50. 46. 29. nan 41. 81. 80. 82.  0.  4.  1. 85. 88. 84.
 83. -2. -3. -6. 87. 86. 89.]
_ _ _ _ _ _ _ _ _ _ _ _ _ _

 71.0    272
 69.0    258
 72.0    244
 67.0    229
 74.0    226
        ... 
 88.0      2
-5.0       2
-3.0       1
-6.0       1
 89.0      1
Name: TAVG, Length: 96, dtype: int64
___________________________


** ** ** ** ** ** ** ** ** ** 

Feature: TMAX

Number of unique values: 99 

[32. 33. 48. 55. 28. 18. 17. 15. 19. 31. 21. 29. 30. 41. 44. 23. 22. 38.
 27. 20. 13.  2.  5. 10. 12.  1. 24. 16. 26. 34. 25. 47. 51. 37. 45. 35.
 42. 39. 46. 67. 52. 40. 36. 58. 50. 65. 66. 61. 78. 60. 71. 57. 70. 69.
 68. 79. 76. 80. 88. 89. 81. 82. 53. 83. 63. 75. 85. 62. 54. 73. 84. 77.
 74. 72. 64. 87. 86. 91. 59. 56. 49. 43. 11. 90. 92. 14.  9.  4. 93.  8.
  7. 94. 98. 97. 95. 96.  6. nan 99.]
_ _ _ _ _ _ _ _ _ _ _ _ _ _

81.0    242
80.0    238
79.0    236
83.0    235
82.0    220
       ... 
1.0       2
7.0       2
98.0      2
5.0       2
2.0       1
Name: TMAX, Length: 98, dtype: int64
___________________________


** ** ** ** ** ** ** ** ** ** 

Feature: TMIN

Number of unique values: 96 

[ 20.  22.  28.   9.  10.   2.   5.  14.  15.  -4.   7.   1.  33.  12.
  23.  24.  13.   8.   4.   3.   0.   6.  -5. -10.  -6.  -8.  -7.  -1.
  25.  19.  27.  29.  31.  35.  30.  16.  26.  18.  17.  32.  37.  39.
  34.  46.  38.  40.  43.  44.  42.  47.  51.  54.  53.  59.  64.  67.
  49.  41.  50.  60.  65.  62.  63.  58.  45.  56.  61.  57.  69.  55.
  52.  71.  72.  66.  68.  70.  48.  36.  21.  11. -12.  74.  75.  76.
  -3.  73.  -2.  -9. -13.  77. -18.  nan  78.  79.  80.  82.]
_ _ _ _ _ _ _ _ _ _ _ _ _ _

 32.0    243
 30.0    235
 63.0    233
 34.0    229
 60.0    227
        ... 
-12.0      2
 80.0      2
-13.0      1
-18.0      1
 82.0      1
Name: TMIN, Length: 95, dtype: int64
___________________________


** ** ** ** ** ** ** ** ** ** 

Feature: WDF2

Number of unique values: 36 

[240 250  60 270 280 220 260 230 350 360 190 200 290  30 310 210 330  40
  50 100 300 180 340  80  90 320  70 160  20 130 140 170 120 110 150  10]
_ _ _ _ _ _ _ _ _ _ _ _ _ _

280    519
290    519
240    492
300    485
180    466
170    449
190    446
220    442
330    436
310    429
230    421
320    412
340    387
250    370
270    368
210    358
350    354
200    333
260    331
160    325
360    278
30     243
20     242
150    241
40     225
140    189
60     158
10     157
50     148
130    129
70     123
100    122
120    110
110    108
90      94
80      51
Name: WDF2, dtype: int64
___________________________


** ** ** ** ** ** ** ** ** ** 

Feature: WSF2

Number of unique values: 45 

[40.  29.1 15.  38.  31.1 21.9 30.  35.1 21.  25.1 16.1 18.1 19.9 10.1
  8.9 23.  23.9 13.  25.9 28.  14.1 17.  36.  12.1 33.1 42.9  8.1 32.
 40.9 36.9 38.9 46.1 45.  47.  49.   6.9 44.1 52.1 48.1  6.  67.1  4.9
 11.  51.  53.9]
_ _ _ _ _ _ _ _ _ _ _ _ _ _

16.1    815
17.0    777
18.1    775
21.0    764
15.0    758
19.9    718
14.1    672
21.9    670
13.0    586
23.0    546
23.9    510
12.1    456
25.1    386
25.9    361
10.1    299
28.0    295
29.1    251
30.0    217
8.9     177
31.1    169
32.0    151
8.1      99
33.1     90
35.1     88
36.0     75
36.9     47
38.0     42
6.9      38
38.9     26
40.0     26
40.9     21
42.9     14
6.0      13
44.1      9
49.0      5
47.0      4
45.0      2
46.1      1
52.1      1
48.1      1
67.1      1
4.9       1
11.0      1
51.0      1
53.9      1
Name: WSF2, dtype: int64
___________________________

None

The method called below shows the instances where there are missing records separated based on feature which has missing values.

In [7]:
list_names = data.get_column_names_where_missing_data()
print(list_names)
data.show_missing(*list_names)
Index(['AWND', 'PRCP', 'TAVG', 'TMAX', 'TMIN'], dtype='object')

Feature name: AWND
STATION NAME LATITUDE LONGITUDE ELEVATION DATE AWND PRCP SNOW SNWD TAVG TMAX TMIN WDF2 WSF2
6049 USW00004725 BINGHAMTON GREATER AP, NY US 42.20678 -75.97993 485.7 2019-07-24 NaN 0.00 0.0 0.0 69.0 75.0 55.0 360 13.0
7349 USW00004781 ISLIP LI MACARTHUR AIRPORT, NY US 40.79389 -73.10181 25.4 2017-02-12 NaN 0.04 0.4 9.8 34.0 40.0 30.0 10 6.9
8170 USW00004781 ISLIP LI MACARTHUR AIRPORT, NY US 40.79389 -73.10181 25.4 2019-05-14 NaN 0.07 0.0 0.0 47.0 52.0 44.0 30 10.1
Feature name: PRCP
STATION NAME LATITUDE LONGITUDE ELEVATION DATE AWND PRCP SNOW SNWD TAVG TMAX TMIN WDF2 WSF2
3734 USW00014735 ALBANY INTERNATIONAL AIRPORT, NY US 42.74722 -73.79913 85.4 2019-03-23 21.03 NaN 0.4 0.0 32.0 37.0 28.0 300 36.0
Feature name: TAVG
STATION NAME LATITUDE LONGITUDE ELEVATION DATE AWND PRCP SNOW SNWD TAVG TMAX TMIN WDF2 WSF2
334 USW00014733 BUFFALO NIAGARA INTERNATIONAL, NY US 42.93998 -78.73606 216.2 2015-12-01 7.61 0.10 0.0 0.0 NaN 53.0 39.0 230 18.1
335 USW00014733 BUFFALO NIAGARA INTERNATIONAL, NY US 42.93998 -78.73606 216.2 2015-12-02 6.71 0.02 0.0 0.0 NaN 49.0 34.0 240 17.0
336 USW00014733 BUFFALO NIAGARA INTERNATIONAL, NY US 42.93998 -78.73606 216.2 2015-12-03 11.41 0.01 0.0 0.0 NaN 44.0 33.0 280 21.0
337 USW00014733 BUFFALO NIAGARA INTERNATIONAL, NY US 42.93998 -78.73606 216.2 2015-12-04 11.63 0.00 0.0 0.0 NaN 44.0 38.0 240 28.0
338 USW00014733 BUFFALO NIAGARA INTERNATIONAL, NY US 42.93998 -78.73606 216.2 2015-12-05 4.92 0.00 0.0 0.0 NaN 45.0 31.0 230 13.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2095 USW00014733 BUFFALO NIAGARA INTERNATIONAL, NY US 42.93998 -78.73606 216.2 2020-09-26 8.50 0.00 0.0 0.0 NaN 80.0 57.0 210 16.1
2096 USW00014733 BUFFALO NIAGARA INTERNATIONAL, NY US 42.93998 -78.73606 216.2 2020-09-27 12.75 0.00 0.0 0.0 NaN 82.0 65.0 210 25.9
2097 USW00014733 BUFFALO NIAGARA INTERNATIONAL, NY US 42.93998 -78.73606 216.2 2020-09-28 10.51 0.01 0.0 0.0 NaN 80.0 62.0 220 31.1
2098 USW00014733 BUFFALO NIAGARA INTERNATIONAL, NY US 42.93998 -78.73606 216.2 2020-09-29 8.50 0.41 0.0 0.0 NaN 67.0 54.0 250 25.1
2099 USW00014733 BUFFALO NIAGARA INTERNATIONAL, NY US 42.93998 -78.73606 216.2 2020-09-30 14.32 1.72 0.0 0.0 NaN 62.0 51.0 210 33.1

367 rows × 15 columns

Feature name: TMAX
STATION NAME LATITUDE LONGITUDE ELEVATION DATE AWND PRCP SNOW SNWD TAVG TMAX TMIN WDF2 WSF2
5584 USW00004725 BINGHAMTON GREATER AP, NY US 42.20678 -75.97993 485.7 2018-04-15 17.9 0.0 0.0 0.0 38.0 NaN NaN 130 29.1
Feature name: TMIN
STATION NAME LATITUDE LONGITUDE ELEVATION DATE AWND PRCP SNOW SNWD TAVG TMAX TMIN WDF2 WSF2
5584 USW00004725 BINGHAMTON GREATER AP, NY US 42.20678 -75.97993 485.7 2018-04-15 17.9 0.0 0.0 0.0 38.0 NaN NaN 130 29.1

3. Visualize missing values¶

The method call below provides visualizations of the missing data in the dataset. This may be helpful to detect and show a correlation between missing values.

In [8]:
data.missing_bar_plot()
<AxesSubplot:>
<AxesSubplot:>
<AxesSubplot:>

Observation 3.01¶

From the above information we get a good idea from what data is missing and how to handle it.

  1. As we are working with data points over a period of time and, each instance represents data for a particular date, it would be ideal not to lose/drop any rows. Therefore we will explore other options. There are quite a few build-in solutions to this problem. Back and forward fill and interpolate are a few such possibilities. However, we will use a custom approach to solve the issue of filling in missing data.

  2. For features: AWND, PRCP, TMAX, and TMIN, the fact that there are data points from several geographically close sources for the same date, will be leveraged.

  3. To fill in missing values in the TAVG column, we will leverage the fact that each instance has data for min and max temperature on the given date. Adding TMAX and TMIN and dividing the result in two will give a close enough approximation to the actual value. It will serve as a theoretical approximation with high probability.

  4. From the above heatmap on the missing data it is clear that only in the case of TMIN and TMAX when one is missing so is the other.

4. Handle/replace missing values¶

Handle missing values in AWND, PRCP, TMAX, and TMIN

In [9]:
# get list of missing with out TAVG
list_names = list_names.drop('TAVG')
list_names
Out[9]:
Index(['AWND', 'PRCP', 'TMAX', 'TMIN'], dtype='object')

The below method call replaces missing values for the features in the list with the average value for all instances of that date.

Sumation of n(1~i) / i

In [10]:
data.average_fill_in_missing(list_names)

The below call replaces missing value with average of two other features.

TAVG = (TMAX + TMIN) / 2

In [11]:
data.half_sum_fill_in_missing('TAVG')

Check to confirm data adjustments

In [12]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10960 entries, 0 to 10959
Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   STATION    10960 non-null  object        
 1   NAME       10960 non-null  object        
 2   LATITUDE   10960 non-null  float64       
 3   LONGITUDE  10960 non-null  float64       
 4   ELEVATION  10960 non-null  float64       
 5   DATE       10960 non-null  datetime64[ns]
 6   AWND       10960 non-null  float64       
 7   PRCP       10960 non-null  float64       
 8   SNOW       10960 non-null  float64       
 9   SNWD       10960 non-null  float64       
 10  TAVG       10960 non-null  float64       
 11  TMAX       10960 non-null  float64       
 12  TMIN       10960 non-null  float64       
 13  WDF2       10960 non-null  int64         
 14  WSF2       10960 non-null  float64       
dtypes: datetime64[ns](1), float64(11), int64(1), object(2)
memory usage: 1.3+ MB

4.1 Data Distribution & Outliers¶

The below method call plots the histogram of each feature in the dataset. From this information, we can start to get an idea of how the data is distributed and so gain deeper insight.

In [13]:
my_plot.plot_hist(data.get_df())

The below boxplot show how the data is distributed over the possible range of values per feature and gives insight into outlier values.

In [14]:
my_plot.plot_outliers(data.get_df())
In [15]:
my_plot.plot_outlier_single_col_for_all(data.get_df())
<Figure size 432x288 with 0 Axes>

Observation 3.01¶

  1. While some values are considered outliers based on the box plots, such values are not unreasonable for weather conditions.

4.2 Split Dataset & Feature Engineering¶

For the next part some more feature engineering will be required.

  • Year, month and day will be extracted from the DATE feature
In [16]:
# extract and add features YEAR, MONTH and DAY from DATE
data.feature_extraction_date()
# extract ad add feature DAYS_SINCE_JAN_1
data.feature_extraction_days_since_jan_1()
In [17]:
# confurm operation 
data.get_df()
Out[17]:
STATION NAME LATITUDE LONGITUDE ELEVATION DATE AWND PRCP SNOW SNWD TAVG TMAX TMIN WDF2 WSF2 YEAR MONTH DAY DAYS_SINCE_JAN_1
0 USW00014733 BUFFALO NIAGARA INTERNATIONAL, NY US 42.93998 -78.73606 216.2 2015-01-01 24.61 0.00 0.3 1.2 23.0 32.0 20.0 240 40.0 2015 1 1 0
1 USW00014733 BUFFALO NIAGARA INTERNATIONAL, NY US 42.93998 -78.73606 216.2 2015-01-02 13.20 0.00 0.0 1.2 31.0 33.0 22.0 250 29.1 2015 1 2 1
2 USW00014733 BUFFALO NIAGARA INTERNATIONAL, NY US 42.93998 -78.73606 216.2 2015-01-03 8.50 0.58 0.0 0.0 27.0 48.0 22.0 60 15.0 2015 1 3 2
3 USW00014733 BUFFALO NIAGARA INTERNATIONAL, NY US 42.93998 -78.73606 216.2 2015-01-04 17.67 0.52 0.0 0.0 44.0 55.0 28.0 250 38.0 2015 1 4 3
4 USW00014733 BUFFALO NIAGARA INTERNATIONAL, NY US 42.93998 -78.73606 216.2 2015-01-05 19.24 0.00 0.1 0.0 20.0 28.0 9.0 270 31.1 2015 1 5 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10955 USW00094789 JFK INTERNATIONAL AIRPORT, NY US 40.63915 -73.76390 2.7 2020-12-27 7.61 0.00 0.0 0.0 29.0 38.0 24.0 280 19.9 2020 12 27 361
10956 USW00094789 JFK INTERNATIONAL AIRPORT, NY US 40.63915 -73.76390 2.7 2020-12-28 10.96 0.00 0.0 0.0 41.0 51.0 33.0 270 19.9 2020 12 28 362
10957 USW00094789 JFK INTERNATIONAL AIRPORT, NY US 40.63915 -73.76390 2.7 2020-12-29 16.11 0.00 0.0 0.0 40.0 43.0 28.0 330 29.1 2020 12 29 363
10958 USW00094789 JFK INTERNATIONAL AIRPORT, NY US 40.63915 -73.76390 2.7 2020-12-30 9.62 0.00 0.0 0.0 33.0 43.0 24.0 210 19.9 2020 12 30 364
10959 USW00094789 JFK INTERNATIONAL AIRPORT, NY US 40.63915 -73.76390 2.7 2020-12-31 11.18 0.36 0.0 0.0 44.0 48.0 35.0 330 21.0 2020 12 31 365

10960 rows × 19 columns

Motiovation: Splitting the dataset based on station may be beneficial for some comparisons and nonrepetitive code.

In [18]:
# get unique values in STATION feature 
list_of_stations = data.get_unique('STATION')
In [19]:
# make list of df split based on STATION name 
df_list_station = data.get_list_dfs(list_of_stations, 'STATION')

Motiovation: Splitting the datasets of stations based on year may be beneficial for some comparisons and nonrepetitive code.

In [20]:
# gets a list of list. Each list on i level contains df split on station 
# and each list on j level contains df split on year for that station  
df_i_station_j_years = data.get_list_of_lists_df_split(df_list_station, 'YEAR')

5. Visual Analysis¶

Correlation Between Data¶

In [21]:
# spearman corr for all feature ecept name, station and date 
my_plot.plot_heat_map_per_feature(data.get_df(), 'spearman')
<Figure size 432x288 with 0 Axes>
In [22]:
my_plot.heat_map(data.get_df())

Observation 5.01¶

  1. The above heat maps show strong correlations between certain key features. While there are apparent relations, there are some which are very interesting.

    Strong corelations:

    • AWND and WSF2
    • TAVG and TMIN, TAVG and TMAX, TMAX and TMIN are obvious ones

Min & Max Temperature analysis¶

The below plots show minimum and maximum temperature over the period split by STATION.

In [23]:
for df_section in df_list_station:
    my_plot.plot_temp_high_low(df_section)

The bellow plot shows minimum and maximum temperature and their relation. Color-coding based on data gatheriing station.

In [24]:
my_plot.join_plot_hue(data.get_df(), 'TMIN', 'TMAX', 'NAME')

The bellow plot is for TAVG over time period utalizing a more generic alternative custom plot method as opposed to the one used to plot the plots higher up in this notebook fot TMIN and TMAX over time

In [25]:
feature_list = ['TAVG']

for df_section in df_list_station:
    my_plot.plot_many(df_section, 'DATE', 'NAME', feature_list, 'Temperature in degree F°')
In [26]:
my_plot.join_plot_hue(data.get_df(), 'DATE', 'TAVG', 'NAME')

The plot below shows precipitation over a year's period of time for each year in the df. One plot per data gathering station.

In [27]:
my_plot.plot_list_of_list(df_i_station_j_years, 'DAYS_SINCE_JAN_1', 'TAVG', 'YEAR', 'NAME', 
                  'Passed days since Jan 1st', 'Average temperature (F°)')

Simular to the plots above but with cumulative sum over the period

In [28]:
my_plot.plot_list_of_list(df_i_station_j_years, 'DAYS_SINCE_JAN_1', 'TAVG', 'YEAR', 'NAME', 
                  'Passed days since Jan 1st', 'Average temperature (F°)', 'cumsum')

Observation 5.02¶

From the above plots, an overall trend is observed as time progresses the difference between min and max temperature decreases, and both those values increase. Therefore, a trend of increase in temperature is present.

Precipitation¶

In [29]:
feature_list = ['PRCP']

for df_section in df_list_station:
    my_plot.plot_many(df_section, 'DATE', 'NAME', feature_list, 'Precipitation in inches')
In [30]:
my_plot.join_plot_hue(data.get_df(), 'DATE', 'PRCP', 'NAME')

Precipitation over a year's peroid for all years in the dataset split based on station¶

The plot below shows precipitation over a year's period for each year in the df. One plot per data gathering station

In [31]:
# plot precipitation over a year period for each year in the df.
# One plot per data gathering station 
my_plot.plot_list_of_list(df_i_station_j_years, 'DAYS_SINCE_JAN_1', 'PRCP', 'YEAR', 'NAME', 
                  'Passed days since Jan 1st', 'Precipitation (inch)')

Simular to the plots above but with cumulative sum over the period

In [32]:
my_plot.plot_list_of_list(df_i_station_j_years, 'DAYS_SINCE_JAN_1', 'PRCP', 'YEAR', 'NAME', 
                  'Passed days since Jan 1st', 'Precipitation (inch)', 'cumsum')
In [33]:
only_first_last_years = [inner[::len(inner)-1] for inner in df_i_station_j_years]
In [34]:
my_plot.plot_list_of_list(only_first_last_years, 'DAYS_SINCE_JAN_1', 'PRCP', 'YEAR', 'NAME', 
                  'Passed days since Jan 1st', 'Precipitation (inch)')
my_plot.plot_list_of_list(only_first_last_years, 'DAYS_SINCE_JAN_1', 'PRCP', 'YEAR', 'NAME', 
                  'Passed days since Jan 1st', 'Precipitation (inch)', 'cumsum')

Fastest 2-minute wind speed & Average daily wind speed¶

Note: Here we can make observations about how windy a particular part of the state is.

Fastest 2-minute wind speed in relation to Average daily wind speed. Color-coding based on data gathering station.

In [35]:
my_plot.join_plot_hue(data.get_df(), 'AWND', 'WSF2', 'NAME')

Experimental Work Section¶

Geo Map¶

Plotly implementaion of map¶

In [36]:
import json
import plotly.express as px
In [37]:
# get geo data from file
ny_state = json.load(open("data/geo/cb_2018_36_place_500k.geojson", 'r'))
In [38]:
# dict for the key id pairs
state_id_map = {}

# extract id
for feature in ny_state['features']:
    feature['id'] = feature['properties']['GEOID']
    state_id_map[feature['properties']['NAME']] = feature['id']

Experimental work

By hand encoding for the GEo map with plotly

  • BUFFALO NIAGARA INTERNATIONAL, NY US - Niagara
  • ALBANY INTERNATIONAL AIRPORT, NY US - Albany
  • BINGHAMTON GREATER AP, NY US - Binghamton
  • ISLIP LI MACARTHUR AIRPORT, NY US - Islip
  • JFK INTERNATIONAL AIRPORT, NY US - Nassau
In [39]:
# get county names
data.get_df()['COUNTY'] = data.get_df()['NAME'].apply(lambda x: x.split(" ")[0])

# clean up county names and match to ones from the geo data file
data.get_df()['COUNTY'] = data.get_df()['COUNTY'].apply(lambda x: 'Niagara Falls' if x == 'BUFFALO' 
                                                        else 'New York' if x == 'JFK' else x.capitalize())
# get ids baased on dict 
data.get_df()['id'] = data.get_df()['COUNTY'].apply(lambda x: state_id_map[x])
In [40]:
# add a sum of all TAVG values divided by total number of occurences per COUNTY unique value

list_of_counties = data.get_df()['COUNTY'].unique()

for county in list_of_counties:
    tavg = data.get_df().loc[data.get_df()['COUNTY'] == county, 'TAVG'].sum()
    number_of_records = data.get_df().loc[data.get_df()['COUNTY'] == 'New York', 'TAVG'].count()
    sum_over_count = tavg/number_of_records
    data.get_df().loc[data.get_df()['COUNTY'] == county, 'SUM_TAVG'] = sum_over_count
In [41]:
fig = px.choropleth(data.get_df(), 
                    locations='id', 
                    geojson=ny_state, 
                    color='SUM_TAVG', 
                    scope='usa', 
                    hover_name='NAME',
                    hover_data=['SUM_TAVG'])

fig.update_geos(fitbounds="locations", visible=True)
fig.show()
In [42]:
center_lon = data.get_df().loc[data.get_df()['COUNTY'] == 'Binghamton', 'LONGITUDE'].tolist()[0]
center_lat = data.get_df().loc[data.get_df()['COUNTY'] == 'Binghamton', 'LATITUDE'].tolist()[0]

fig = px.choropleth_mapbox(data.get_df(), 
                            locations='id',
                            geojson=ny_state, 
                            color='SUM_TAVG', 
                            hover_name='NAME',
                            hover_data=['SUM_TAVG'],
                            mapbox_style='carto-positron',
                            zoom = 5.8,
                            center={'lat': center_lat, 'lon': center_lon})
fig.show()

Ending remarks¶

  • Through our observations we have extrapolated that with the progression of time over a six-year period in NY state, there is an increase in both precipitation and temperature on a minimum and maximum scale, as well as a rise in the accumulative average. This is in line with our original hypothesis stated at the beginning of this study.
  • We can also conclude that on average, certain parts of the state tend to consistently be warmer even though geographically close. The warmest on average of these stations has been JFK.
  • We consistently see that Albany, specifically, experiences higher maximums and lower minimums than the other stations.

Future work¶

  • Future studies may include further exploration of not naively correlational features but still show a sufficient correlation.
  • Further exploration of the relation between precipitation and temperature may be fruitful, including the effect that elevation has on these readings and the strength of the correlation.
  • Both the topic and the dataset appear to be well suited for time series analysis with the aim of obtaining highly probable values of time periods yet to occur.
  • It would be beneficial to continue gathering data, including a weather station from each county of New York state, to gain a more holistic view of weather patterns statewide.
  • This study could also be expanded to include similar case studies of different areas of the world over the same six-year period, to compare rate of temperature and precipitation increase on a world scale.
In [ ]: